For my project, I used k-Means Clustering on a dataset containing statistics regarding the 2016 Primaries. I first processed the dataset in order to cut it down to only the counties that voted for Trump. I then normalized this data and ran a k-Means algorithm on it to see what groups of people voted for him.
In [85]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
Importing the .csv files and processing them into a single dataframe.
In [86]:
# processing .csv containing county statistics
counties = pd.read_csv('county_facts.csv')
drop_columns = ["state_abbreviation", "fips"]
counties.drop(drop_columns,inplace=True,axis=1)
# combine it with .csv containing primary statistics
primary = pd.read_csv('primary_results.csv')
primary = pd.concat([primary,counties], axis=1)
trump = primary[primary['candidate'] == 'Donald Trump'].sort_index()
# drop the features we don't need
drop_columns = ["state_abbreviation", "party", "candidate","area_name"]
trump.drop(drop_columns,inplace=True,axis=1)
# get rid of counties with no statistical data
trump = trump.fillna(0.0)
trump = trump[trump['POP010210'] > 0]
trump.head()
Out[86]:
This creates a dataframe containing all of the counties where Trump won. Now the data has to be normalized.
In [87]:
state = trump["state"]
county = trump["county"]
# any of the features in the trump dataframe can be used, these were chosen because they seemed interesting
# percent who voted for donald trump
fraction_votes = trump["fraction_votes"]
fraction_votes_norm = np.array((fraction_votes - fraction_votes.min()) / (fraction_votes.max() - fraction_votes.min())).reshape(-1,1)
# median household income of the country
median_income = trump["INC110213"]
median_income_norm = np.array((median_income - median_income.min()) / (median_income.max() - median_income.min())).reshape(-1,1)
# percent of people in the county who were born outside of america
foreign_born = trump["POP645213"]
foreign_born_norm = np.array((foreign_born - foreign_born.min()) / (foreign_born.max() - foreign_born.min())).reshape(-1,1)
# percent of people in the county who graduated high school
high_school = trump["EDU635213"]
high_school_norm = np.array((high_school - high_school.min()) / (high_school.max() - high_school.min())).reshape(-1,1)
# percent of people in the county with a bachelors degree
bachelors = trump["EDU685213"]
bachelors_norm = np.array((bachelors - bachelors.min()) / (bachelors.max() - bachelors.min())).reshape(-1,1)
# the features to be used in k-Means are added to 2-D arrays
trump_norm = np.hstack((high_school_norm, median_income_norm))
Graphs showing the relationships between some of these features and the election results are displayed below.
In [88]:
# graphs of the normalized data
f, axarr = plt.subplots(2, 2)
axarr[0,0].set_title('Income and Trump Votes')
axarr[0,0].scatter(median_income_norm, fraction_votes_norm, c='red')
axarr[0,1].set_title('Foreigners and Trump Votes')
axarr[0,1].scatter(foreign_born_norm, fraction_votes_norm, c='green')
axarr[1,0].set_title('College and Trump Votes')
axarr[1,0].scatter(bachelors_norm, fraction_votes_norm, c='blue')
axarr[1,1].set_title('High School and Trump Votes')
axarr[1,1].scatter(high_school_norm, fraction_votes_norm, c='yellow')
plt.setp([a.get_xticklabels() for a in axarr[0, :]], visible=False)
plt.setp([a.get_yticklabels() for a in axarr[:, 1]], visible=False)
plt.show()
With the data normalized the k-Means algorithm can be run on it. In order to find the optimal number of clusters the silhouette score was calculated for different numbers.
In [89]:
best_nc = 0
best_ss = 0
for n_clusters in range(2,10):
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(trump_norm)
silhouette_avg = silhouette_score(trump_norm, cluster_labels)
print("For", n_clusters,"clusters, the average silhouette score is", silhouette_avg)
if silhouette_avg > best_ss:
best_nc = n_clusters
best_ss = silhouette_avg
print("The best number of clusters is",best_nc)
With the optimal number of clusters, the most accurate model can be created.
In [90]:
kmeans = KMeans(n_clusters=best_nc, random_state=10)
kmeans.fit(trump_norm)
Out[90]:
The results of the k-Means algorithm are plotted below. The code for plotting is taken from the sklearn documentation.
In [91]:
h = .02
x_min, x_max = trump_norm[:, 0].min() - 1, trump_norm[:, 0].max() + 0.5
y_min, y_max = trump_norm[:, 1].min() - 1, trump_norm[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(trump_norm[:, 0], trump_norm[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=100, linewidths=3,
color='w', zorder=10)
plt.title('K-means Clustering on Primary Results')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()
The two clusters on the top and bottom are spread out and include many outliers. The middle cluster is the most dense and likely represents the average Trump-supporting county. From this, it appears that counties with Trump supporters have average to below-average median household incomes and average to slightly-below-average high school graduation rates.